Recovering Internet Service Sessions from Operating System Failures Motivation and Approach
نویسندگان
چکیده
Critical Internet services such as ecommerce, online auctions, and banking run on complex, multi-tier architectures built with commodity (offthe-shelf) machines and operating systems. These stateful services are sensitive to server failures: active client sessions on these servers are lost, although the state associated with them might still be intact in a failed machine’s memory. We developed a recovery approach that exploits hardware and software redundancy in Internet service installations to reuse active clients’ session state after OS failures (http://discolab. rutgers.edu/bda). Our lightweight, application-independent system provides both failure detection and recovery, for use with complex, multi-tier Internet services. The core of the system is the novel Backdoors (BD) architecture,1 which uses commodity programmable network interface cards (NICs) with specialized firmware and OS extensions to provide remote access to lightweight application and OS state in a machine’s memory without relying on its OS or processors. Using BD, machines in an Internet server cluster can cooperatively observe each other’s health, detect failures, and take over client sessions from failed nodes. In this article, we describe the BD architecture and our OS extensions for monitoring and recovery of service sessions. We have implemented a prototype in the FreeBSD 4.8 kernel, using Myrinet LanaiXP programmable NICs (www.myri.com). The results from our experiments with the Rice University Bidding System (Rubis; http://rubis.objectweb.org), a cluster-based
منابع مشابه
Nonintrusive Failure Detection and Recovery for Internet Services Using Backdoors
We describe an architecture for nonintrusive failure detection and recovery in a cluster of Internet servers in which nodes mutually monitor their liveness and recover client sessions from failed nodes. The system is based on Backdoors, a novel architectural approach for remote healing of computer systems. Backdoors enables monitoring and recovery/repair of state in a computer system by remote ...
متن کاملSurviving Internet Catastrophes
In this paper, we propose a new approach for designing distributed systems to survive Internet catastrophes called informed replication, and demonstrate this approach with the design and evaluation of a cooperative backup system called the Phoenix Recovery Service. Informed replication uses a model of correlated failures to exploit software diversity. The key observation that makes our approach...
متن کاملRecovering from Faulty Device Drivers
Several studies (see Swift et. al.’s study of Windows XP in SOSP 2003 and Chou et. al’s study of Linux in SOSP 2001) have attributed a large fraction of operating system failures to device driver flaws. Not only can driver errors cause kernel instability, but these errors can also be exploited for privilege escalation and access to kernel data structures. A search on securityfocus.com shows vul...
متن کاملService Continuations: An Operating System Mechanism for Dynamic Migration of Internet Service Sessions
We propose service continuations (SC), an OS mechanism that supports seamless dynamic migration of Internet service sessions between cooperating multi-process servers. Service continuations provide a server application with a simple and easy to use abstraction, and a means to migrate the service state along with the serviced connection. SC supports transparent resumption of service to the clien...
متن کاملMicroreboot - A Technique for Cheap Recovery
A significant fraction of software failures in large-scale Internet systems are cured by rebooting, even when the exact failure causes are unknown. However, rebooting can be expensive, causing nontrivial service disruption or downtime even when clusters and failover are employed. In this work we separate process recovery from data recovery to enable microrebooting – a fine-grain technique for s...
متن کامل